Choose embedding when data is primarily read together with its parent, has a one-to-few relationship, and doesn't frequently change independently; choose referencing for one-to-many/many-to-many relationships, frequently changing data, or when you need to query the related data in isolation.
The decision between embedding (denormalization) and referencing (normalization) in MongoDB is a critical schema design choice that directly impacts performance, consistency, and scalability. Unlike relational databases where normalization is often the default, MongoDB is designed to make embedding a first-class citizen. The right choice depends on analyzing how your application accesses and modifies data.
Embedding is the act of placing related data directly inside a parent document. You should choose embedding when your data naturally forms a one-to-few relationship. Classic examples include an orders document containing multiple addresses (shipping and billing) or a post document containing a few comments. In these cases, the embedded data is almost always accessed as part of the parent document, making retrieval extremely fast with a single database read. You also choose embedding when atomic updates are required—because a single document operation is atomic, you can update both the parent and its embedded children in one go, ensuring consistency .
Referencing (or linking) stores related data in separate collections and includes references (like ObjectIds) in the parent document. This is appropriate for one-to-many or many-to-many relationships. For example, an author can have hundreds of books, or a product can be part of many orders. When the related data is frequently accessed on its own, referencing is better. If you need to query all books independently of their authors, storing them in a separate books collection with an authorId reference is the right choice . Referencing also helps with write scalability—if thousands of comments are being added to a post, updating a separate comments collection is more efficient than continually growing and potentially exceeding the 16MB document size limit.
How do you access the data? If you almost always retrieve the related data together with the parent (e.g., a user and their profile), embedding is more efficient . If you often access related data independently, use referencing.
What is the cardinality? For one-to-few (like addresses or recent sessions), embed. For one-to-many (like orders or comments) or many-to-many, reference .
How often does the data change? If the embedded data changes frequently and independently of the parent, referencing can reduce write amplification . For example, a product price that changes often would be better referenced so you update it once, and all documents referencing it see the change.
What are your consistency requirements? Embedding allows atomic updates to the parent and all related data. With referencing, you need transactions or application-level logic to maintain consistency across collections .
Will the data grow unbounded? Embedded documents have a hard 16MB size limit. If an array can grow indefinitely (e.g., user comments, log entries), referencing is the only viable choice .
Are you duplicating data that needs to be consistent? Denormalization (embedding) often duplicates data. If you embed the same product details in every order document and the product price changes, you must update every order containing that product—an expensive operation .
MongoDB's official documentation suggests starting with embedding by default—it's usually the right choice for many use cases . Only move to referencing when you encounter a specific limitation: unbounded growth, excessive data duplication requiring consistency, or when you need to query the related data independently. This 'embed-first' approach often leads to simpler, more performant applications that fit naturally with MongoDB's document model.
Sometimes the best solution is a hybrid approach. You might embed frequently accessed summary data while referencing full details. For example, an orders collection could embed a snapshot of product details at the time of purchase (price, name) while keeping the full product catalog separate. This gives you fast reads for order history without worrying about product price changes affecting past orders. This is a form of denormalization that balances read performance with write complexity.